Data importing and Data Cleaning

Data Importing : Importing the relevant libraries, could import basic libraries using one command using the pyforest library

Data importing : Importing the raw bikeride csv files

Data importing : Importing the raw power generation csv files

Data importing : Importing the weather data for the dates in the data using an API from the website : https://www.worldweatheronline.com/. This API key is purchased for personal use. This request shouldnt be run as it charges the user per instance of weather calls.

Data importing : The weather data is saved as a .csv file in the directory as 'London.csv'.

Data cleaning : The data has been imported and cleaned

Data importing : Importing the Holidays data

Data cleaning : Filtering the dataframe on the features required for the analysis.

Data cleaning : Cleaning and filtered data based on daily,yearly,monthly basis and other criterions

Data cleaning : Renaming the columns

Checking the data types of the features :

Data Cleaning : Cleaning the data for Daily Bike Ride Data

Data Cleaning : Filtering daily data based on features needed for the analysis.

EXPLORATORY DATA ANALYSIS (EDA) :

Analysis : Mean rides over the years

Analysis: Mean rides over the months

Analysis: Mean rides based on Weekdays and Weekends

Analysis: Mean rides over the day of the week

Analysis: Bike rides over the month compared on the basis of Weekdays and Weekends

Analysis : Distribution of Rides

Analysis: Bike rides over the week compared on the basis of the months of the year

Machine Learning

Machine Learning has been used in the next section to predict the number of bike rides. Data-Sets used for the prediction are

1. On TFL data
2. On TFL data merged with weather data and holiday data
3. On TFL data merged with Power Generation data and holiday data
4. On TFL data merged with weather data, Power Generation data and holiday data

As this analysis deals with a regression problem, 5 different machine learning algorithms have been used for this analysis

1. Linear Regression
2. Decision Tree Regressor
3. Random Forest Regressor
4. ADA-Boost Regressor
5. Neural Network

Finally ARIMA and SEASONAL ARIMA (SARIMA) has been used for the prediction of Bike Rides.

The evaluation metrics of the model are enumerated at the end. The metrics considered for this analysis are Mean Absolute Error, Mean Squared Error, Root Mean Squared Error and Mean Absolute Percentage Error. The reason for choosing these is because they are industry standard metrics used to evaluate performance of models. Finally a scatter plot has also been shown as a metric for each machine learning model which lets the viewer identify the efficacy of the model. The closeness of the scatter between the test value vs the predicted value and y=x line shows how efficient the model is.

1. Machine Learning with only 'TFL' data

One Hot Encoding for Time and Weekday/Weekend Feature

Train Test Split

The data has been split into a training set and a testing set. The model will be trained on the training set and then the test set will be used to evaluate the model.

Linear Regression

Decision Tree

Random Forest

ADA Boost Regressor

Neural Network

2. Machine Learning with TFL Data, Weather Data and Holiday Data

Merging 'weather_data' to 'daily_data' as 'final_data'

Merging with weather data adds a number of features to the data. While some of them look very important, the others doesnt look like have much importance to the data (for example : moon_illumination). To leave nothing to chance, data correlation was plotted and only features with high correaltion were considered for the analysis.

Observation:

From the correaltion heatmap, it can be seen that the Number of Bicycle Hires are positively correlated to

  1. 'ST_YEAR'
  2. 'Weekday/Weekend'
  3. 'Max Temperature'
  4. 'Minimum Temperature'
  5. 'Sun hour'
  6. 'UV Index'
  7. 'Feel Like C'
  8. 'Heat Index'
  9. 'Wind Chill'
  10. 'Pressure',
  11. 'Temperature'
  12. 'Visibility'.
  13. 'Workday'
  14. 'Non Work_day'

One Hot Encoding :

Linear Regression

Decision Tree

Random Forest

AdaBoostRegressor

Neural Network

3. Machine Learning with TFL Data, Power Generation Data and Holiday Data

Similar to the earlier analysis, correaltion with the Power Generation Data has been checked to consider features that are important for the analysis.

Observation:

From the correaltion heatmap, it can be seen that the Number of Bicycle Hires are positively correlated to

  1. 'ST_YEAR'
  2. 'Weekday/Weekend'
  3. 'NI Int'
  4. 'Dutch Int'
  5. 'Biomass'
  6. 'Net Pumped'
  7. 'Hydro'
  8. Holiday data

Linear Regression

Decision Tree Regressor

Random Forest Regressor

ADA BOOST Regressor

Neural Network

4. Machine Learning with TFL Data, Power Generation Data, Weather Data and Holiday Data

Merging weather, power generation and holiday data into a new dataframe 'final_data_w_p_h'

From the correaltion heatmap, it can be seen that the Number of Bicycle Hires are positively correlated to

1.'ST_YEAR'
2.'Weekday/Weekend'
3.'NI Int'
4.'Dutch Int'
5.'Biomass'
6.'Net Pumped'
7.'Hydro'
8.'Holiday data'
9.'Max Temperature'
10.'Minimum Temperature'
11.'Sun hour'
12.'UV Index'
13.'Feel Like C'
14.'Heat Index'
15.'Wind Chill'
16.'Pressure',
17.'Temperature'
18.'Wkday_Wend'.
19.'Non Work_day'

Linear Regression

Decision Tree Regressor

Random Forest Regressor

ADA BOOST Regressor

Neural Network

Time Series Analysis

Train Test Split

The Train Test split for time series analysis has not been done in a traditional manner using the train_test_split( ) library. This is to ensure the continuity of the data which also takes care of the autocorrelation factor among the observations.

ARIMA MODELLING

Fit an ARIMA(5,1,5) Model

Evaluate the Model

SARIMA MODELLING

End of Notebook